Univariate Data Embedding¶
  • Generate Univariate IoT Data:
    • The function generate_univariate_iot_data() creates a dataset for one day of IoT data with readings every 30 minutes, simulating a smart meter device. Each entry has a unique smart_meter_device_ID, reading_timestamp, and a random meter_reading.
  • Embed Univariate IoT Data:
    • The function embed_univariate_iot_data() uses the SentenceTransformer model to generate embeddings for each row. The data is passed as a concatenation of the device_ID, timestamp, and meter_reading for semantic embedding generation.
  • Store Embeddings in FAISS:
    • The embeddings are stored in a FAISS vector database using the faiss.IndexFlatL2 index, which allows for fast nearest neighbor searches.
  • Query Search:
    • The function retrieve_similar_data() allows querying the vector database with a specific query (like a device ID or meter reading). The query is embedded and compared to the stored embeddings to retrieve the top k=3 most similar data points.
  • Displaying Results:
    • The display_markdown() function prints the original and predicted data in markdown table format with headers and borders, using the to_markdown() method.
    • A swarm plot visualizes the predicted univariate data using matplotlib, where the x-axis represents meter readings and the y-axis represents the device indices.
In [ ]:
%pip install -q faiss-cpu sentence-transformers pandas numpy matplotlib
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed.
databricks-feature-store 0.14.3 requires pyspark<4,>=3.1.2, which is not installed.
ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 2.1.3 which is incompatible.
ydata-profiling 4.2.0 requires scipy<1.11,>=1.4.1, but you have scipy 1.14.1 which is incompatible.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 2.1.3 which is incompatible.
mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible.
langchain 0.0.217 requires numpy<2,>=1, but you have numpy 2.1.3 which is incompatible.
databricks-feature-store 0.14.3 requires numpy<2,>=1.19.2, but you have numpy 2.1.3 which is incompatible.
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
In [ ]:
import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

# Step 1: Generate univariate IoT time series data (smart meter data)
def generate_univariate_iot_data():
    np.random.seed(0)  # For reproducibility
    start_time = pd.to_datetime('2024-01-01')
    # Create time series data for 1 day with 30 minutes frequency, 5 hours in total (10 periods)
    time_index = pd.date_range(start=start_time, periods=11, freq='30T')
    
    # Generate random meter readings
    meter_readings = np.random.uniform(980, 1050, size=11)  # Meter reading values
    
    # Simulate smart meter device IDs (e.g., device IDs could be unique)
    smart_meter_device_IDs = [f"device_{i}" for i in range(1, 12)]
    
    # Create DataFrame
    data = {
        'smart_meter_device_ID': smart_meter_device_IDs,
        'reading_timestamp': time_index,
        'meter_reading': meter_readings
    }
    
    return pd.DataFrame(data)

# Step 2: Embed univariate IoT data using SentenceTransformer
def embed_univariate_iot_data(df):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    
    # Create a string representation of each row to feed into the model
    texts = df.apply(lambda row: f"device: {row['smart_meter_device_ID']} timestamp: {row['reading_timestamp']} reading: {row['meter_reading']}", axis=1).tolist()

    # Create embeddings
    embeddings = model.encode(texts)
    
    return embeddings

# Step 3: Store embeddings in FAISS vector database
def store_embeddings_in_faiss(embeddings):
    embeddings = np.array(embeddings).astype('float32')
    index = faiss.IndexFlatL2(embeddings.shape[1])  # L2 distance metric
    index.add(embeddings)
    return index

# Step 4: Retrieval-Augmented Generation (RAG) System to query embeddings
def retrieve_similar_data(query, index, k=3):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    query_embedding = model.encode([query]).astype('float32')
    D, I = index.search(query_embedding, k)  # D is distances, I is indices of nearest neighbors
    return I[0], D[0]

# Step 5: Display original data and predicted data in markdown format
def display_markdown(df, title="Data"):
    print(f"\n{title}:\n")
    print(df.to_markdown(index=False))

# Step 6: Visualize predicted data as a multi-group spider (radar) chart
def plot_multi_group_spider_chart(data, title):
    # Normalize the data for radar chart (use meter readings, embedding distances, etc.)
    scaler = MinMaxScaler()
    scaled_values = scaler.fit_transform(data[['meter_reading', 'embedding_distance']].values)
    
    # Convert the "smart_meter_device_ID" to a numerical value (e.g., using indices)
    device_ids = data['smart_meter_device_ID'].astype('category').cat.codes.values
    # Convert the "reading_timestamp" to a numerical value (e.g., Unix timestamp)
    timestamps = data['reading_timestamp'].astype('int64') / 1e9  # Convert to seconds
    
    # Combine all values into a single matrix for scaling
    full_values = np.column_stack([scaled_values, device_ids, timestamps])
    full_scaled_values = scaler.fit_transform(full_values)
    
    # Setup the radar chart categories
    categories = ['meter_reading', 'embedding_distance', 'smart_meter_device_ID', 'reading_timestamp']
    
    # Create angles for the radar chart
    num_vars = len(categories)
    angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
    angles += angles[:1]  # Close the loop
    
    # Create the figure for the plot
    fig, ax = plt.subplots(figsize=(5, 5), dpi=80, subplot_kw=dict(polar=True))
    
    # Plot each row (predicted data) on the radar chart
    for i, row in data.iterrows():
        row_values = np.concatenate([full_scaled_values[i], full_scaled_values[i][:1]])  # Close the loop
        ax.fill(angles, row_values, alpha=0.25)
        ax.plot(angles, row_values, label=f'{row["smart_meter_device_ID"]}', linewidth=2)
    
    # Set up the labels and ticks
    ax.set_yticklabels([])  # Hide radial labels
    ax.set_xticks(angles[:-1])  # Set the x-ticks to be the categories
    ax.set_xticklabels(categories, fontsize=12)  # Set the labels of each axis
    
    # Add a legend and title
    ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.1), fontsize=12)
    plt.title(title, size=14)
    plt.show()

# Main execution
if __name__ == "__main__":
    # Generate univariate IoT data
    df = generate_univariate_iot_data()
    
    # Embed the univariate IoT data
    embeddings = embed_univariate_iot_data(df)
    
    # Store the embeddings in FAISS
    index = store_embeddings_in_faiss(embeddings)
    
    # Display the original univariate IoT data
    display_markdown(df, "Univariate IoT Data")

    # Display only first 5 embeddings of the original data
    embedding_df = pd.DataFrame(embeddings)
    print(f"\nEmbeddings of the Univariate IoT Data: (first 5 dimensions)")
    print(f" {len(embedding_df.columns)} vector dimensions were created\n")
    print(embedding_df.iloc[:,:5].head(5).to_markdown(index=False))
  
    # User Query: Let's query by a specific element (device ID, timestamp, or reading)
    user_query = "smart_meter_device_ID : device_7 reading: 1010"
    print(f"\nUser Query: {user_query}\n")

    # Retrieve top 3 best matches based on the query
    indices, distances = retrieve_similar_data(user_query, index)

    # Get the top 3 predicted data
    predicted_df = df.iloc[indices].reset_index(drop=True)
    predicted_df['embedding_distance'] = distances
    predicted_df = predicted_df.sort_values(by='embedding_distance', ascending=True).reset_index(drop=True)
    
    # Display predicted univariate IoT data
    display_markdown(predicted_df, "Predicted Univariate IoT Data")

    # Display only first 5 embeddings of the predicted data
    predicted_embeddings = np.array(embeddings)[indices]
    predicted_embedding_df = pd.DataFrame(predicted_embeddings)
    print(f"\nFirst 5 Embeddings of the Predicted Univariate IoT Data:\n")
    print(predicted_embedding_df.iloc[:,:5].head(5).to_markdown(index=False))

    # Display multi-group spider chart for predicted data
    plot_multi_group_spider_chart(predicted_df, "Predicted Univariate IoT Data")
Univariate IoT Data:

| smart_meter_device_ID   | reading_timestamp   |   meter_reading |
|:------------------------|:--------------------|----------------:|
| device_1                | 2024-01-01 00:00:00 |         1018.42 |
| device_2                | 2024-01-01 00:30:00 |         1030.06 |
| device_3                | 2024-01-01 01:00:00 |         1022.19 |
| device_4                | 2024-01-01 01:30:00 |         1018.14 |
| device_5                | 2024-01-01 02:00:00 |         1009.66 |
| device_6                | 2024-01-01 02:30:00 |         1025.21 |
| device_7                | 2024-01-01 03:00:00 |         1010.63 |
| device_8                | 2024-01-01 03:30:00 |         1042.42 |
| device_9                | 2024-01-01 04:00:00 |         1047.46 |
| device_10               | 2024-01-01 04:30:00 |         1006.84 |
| device_11               | 2024-01-01 05:00:00 |         1035.42 |

Embeddings of the Univariate IoT Data: (first 5 dimensions)
 384 vector dimensions were created

|         0 |          1 |        2 |         3 |         4 |
|----------:|-----------:|---------:|----------:|----------:|
| -0.365002 | -0.0616462 | 0.136118 | -0.23923  | -0.274961 |
| -0.297714 | -0.0963958 | 0.108235 | -0.247195 | -0.207936 |
| -0.329856 | -0.096536  | 0.110829 | -0.283669 | -0.217568 |
| -0.235997 | -0.0913811 | 0.148862 | -0.315279 | -0.333508 |
| -0.242079 | -0.196711  | 0.16983  | -0.240875 | -0.133976 |

User Query: smart_meter_device_ID : device_7 reading: 1010


Predicted Univariate IoT Data:

| smart_meter_device_ID   | reading_timestamp   |   meter_reading |   embedding_distance |
|:------------------------|:--------------------|----------------:|---------------------:|
| device_7                | 2024-01-01 03:00:00 |         1010.63 |              26.3377 |
| device_1                | 2024-01-01 00:00:00 |         1018.42 |              26.4215 |
| device_4                | 2024-01-01 01:30:00 |         1018.14 |              27.0364 |

First 5 Embeddings of the Predicted Univariate IoT Data:

|         0 |          1 |        2 |         3 |         4 |
|----------:|-----------:|---------:|----------:|----------:|
| -0.287904 | -0.130074  | 0.127292 | -0.229564 | -0.270398 |
| -0.365002 | -0.0616462 | 0.136118 | -0.23923  | -0.274961 |
| -0.235997 | -0.0913811 | 0.148862 | -0.315279 | -0.333508 |
No description has been provided for this image